# packages that we need for the notebook
require(preprocessCore)
require(sva)
require(pcaMethods)
require(gplots)
require(e1071)
source("black_box_svm.R")
Good models in machine learning require high quality data. Otherwise happens what we call garbage in, garbage out. Furthermore, prediction models need standardized data; otherwise the trained model will only work for the same particular set of data that was used for training. For these reasons, the first steps in machine learning are composed of data exploration and quality control; and are crucial to ensure that all the downstream tasks are not doomed from the start.
In this notebook we will be using a transcriptomics dataset of ColoRectal Cancer (CRC). CRC has four different subtypes (CMS1, CMS2, CMS3 and CMS4) that it can be classified to (you can read more about it here). Our objective is to train a machine learning model to classify a CRC sample into the four subtypes. For example, this would be useful if different subtypes receive different treatments, and we want to make sure that a new patient gets the best treatment possible.
We will go through a checklist of the main tasks to do before embarking in any machine learning process. This includes:
Let’s think about two characteristics about the data:
A little secret
Loading the dataset:
sample_data <- read.table(file = 'rawdata/samples_data.txt', header = TRUE, sep = '\t')
expr_data <- read.table(file = 'rawdata/expr_data.txt', header = TRUE, sep = '\t')
gene_data <- read.table(file = 'rawdata/gene_data.txt', header = TRUE, sep = '\t')
We have just read 3 tables that contain the following information:
sample_data: contains information about the different samples.expr_data: contains the actual data on gene expression, rows = genes and columns = samples.gene_data: contains information about the measured genes: common names, different ids, etc.The dimensions of expr_data tell us about the amount of genes and samples.
dim(expr_data)
## [1] 20742 56
We have 56 samples and 20742 genes.
We can take a look at the sample_data data.frame to see how does our experiment look like.
sample_data
## SampleId sample_name batch CMS
## 1 1 sample_14 1 CMS1
## 2 2 sample_17 1 CMS1
## 3 3 sample_68 1 CMS1
## 4 4 sample_71 1 CMS1
## 5 5 sample_67 1 CMS2
## 6 6 sample_69 1 CMS2
## 7 7 sample_15 1 CMS2
## 8 8 sample_72 1 CMS2
## 9 9 sample_107 1 CMS1
## 10 10 sample_70 1 CMS2
## 11 11 sample_30 1 CMS3
## 12 12 sample_29 1 CMS3
## 13 13 sample_4 1 CMS3
## 14 14 sample_31 1 CMS4
## 15 15 sample_12 1 CMS4
## 16 16 sample_33 1 CMS4
## 17 17 sample_2 1 CMS4
## 18 18 sample_16 1 CMS4
## 19 19 sample_34 1 CMS4
## 20 20 sample_14 2 CMS1
## 21 21 sample_17 2 CMS1
## 22 22 sample_68 2 CMS1
## 23 23 sample_71 2 CMS1
## 24 24 sample_67 2 CMS2
## 25 25 sample_69 2 CMS2
## 26 26 sample_108 2 CMS3
## 27 27 sample_15 2 CMS2
## 28 28 sample_72 2 CMS2
## 29 29 sample_70 2 CMS2
## 30 30 sample_30 2 CMS3
## 31 31 sample_29 2 CMS3
## 32 32 sample_4 2 CMS3
## 33 33 sample_31 2 CMS4
## 34 34 sample_12 2 CMS4
## 35 35 sample_33 2 CMS4
## 36 36 sample_2 2 CMS4
## 37 37 sample_16 2 CMS4
## 38 38 sample_34 2 CMS4
## 39 39 sample_14 3 CMS1
## 40 40 sample_17 3 CMS1
## 41 41 sample_68 3 CMS1
## 42 42 sample_71 3 CMS1
## 43 43 sample_67 3 CMS2
## 44 44 sample_69 3 CMS2
## 45 45 sample_15 3 CMS2
## 46 46 sample_72 3 CMS2
## 47 47 sample_70 3 CMS2
## 48 48 sample_30 3 CMS3
## 49 49 sample_29 3 CMS3
## 50 50 sample_4 3 CMS3
## 51 51 sample_31 3 CMS4
## 52 52 sample_12 3 CMS4
## 53 53 sample_33 3 CMS4
## 54 54 sample_2 3 CMS4
## 55 55 sample_16 3 CMS4
## 56 56 sample_34 3 CMS4
str(sample_data)
## 'data.frame': 56 obs. of 4 variables:
## $ SampleId : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sample_name: Factor w/ 20 levels "sample_107","sample_108",..: 4 7 16 19 15 17 5 20 1 18 ...
## $ batch : int 1 1 1 1 1 1 1 1 1 1 ...
## $ CMS : Factor w/ 4 levels "CMS1","CMS2",..: 1 1 1 1 2 2 2 2 1 2 ...
table(sample_data$CMS)
##
## CMS1 CMS2 CMS3 CMS4
## 13 15 10 18
What can you say about the data?
There are 56 samples
Each sample is done in triplicate.
There are 3 batches.
Each replicate is in a different batch.
Not all CMS subtypes are represented equally.
Normalization is a very broad term, it could also include data scaling and batch correction. For this notebook we will keep them separate and define normalization as the process to ensure that all the different samples are comparable between them.
Again, think about the type of data that we are working with and about the design of the experiment. Some examples:
Different types of data -> different approaches for normalization
Different types of experiments -> different approaches for normalization
This is microarray data that has already been background corrected using the RMA method, moreover the data is log2 transformed.
A good starting point is to plot the different samples using boxplots. This kind of visualization can give as a general overview of what is the data distribution in each sample.
boxplot(expr_data)
This plot already shows some interesting distributions. What do you see?
There are batch effects. These kind of batch effects should be already known in advance. It is important to know how the samples were processed and annotate which samples belong to which batch.
There are two samples that have a much lower overall gene expression, perhaps something went wrong with them.
Overall all most of the samples have a similar distribution with slighly different means.
Because normalization takes into account data from all the samples it is important to make sure that all the samples are actually good. At first glance there seems to be two bad samples. Therefore, we have to make a decision on what to do with them:
What can we do to explore a bit better these samples? Any ideas on what might be wrong with them?
We can for example look at the amount of missing values, do they just have overall low expression of many genes or is it just that they are not measured at all.
Let’s take a look at the amount of missing values
## sum the amount of missing values in each column (sample)
apply(expr_data, 2, function(x) sum(is.na(x)))
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
## 27 26 41 25 32 30 24 32 14462 36 26 33 23
## X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26
## 29 23 30 24 26 27 28 32 35 17 29 17 13151
## X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38 X39
## 33 21 38 25 31 21 32 23 24 19 35 17 39
## X40 X41 X42 X43 X44 X45 X46 X47 X48 X49 X50 X51 X52
## 31 23 34 28 30 26 23 24 27 30 29 27 25
## X53 X54 X55 X56
## 19 18 24 23
We an see that more than 2/3 of the probes have a missing value. That is not a good sign and will definitely affect our normalization statistics. Therefore we can remove these samples.
## select column numbers with more than 10000 missing values
bad_samples <- which(apply(expr_data, 2, function(x) sum(is.na(x))) > 10000)
## remove such samples from our tables
expr_data <- expr_data[,-bad_samples]
sample_data <- sample_data[-bad_samples,]
We can check back at the boxplot to see if we have removed the correct ones.
boxplot(expr_data)
Trashing samples is a simple method. It might be possible or not depending on the amount of samples that one has. Moreover, carefully consider if a bad sample is really due to technical error, perhaps there is some biology behind, and throwing away such samples might add bias to our experiment.
Now we can actually normalize the data. There are several methods to do so, some more correct than others depending on the experiment design and the data type. The important concept here is that we want to make the measurements of genes comparable between samples.
For example, if geneA has an expression of 2 in sampleA and an expression of 4 in sampleB; we want to be able to say that sampleB has double the expression because it is real and not because sampleB had double the input of mRNA in the microarray.
And finally, normalization implies that we have some assumptions about the data and the experiment. Different normalization approaches imply different assumptions, and these can help us decide which method fits best for our experiment.
What would be some fair assumptions about this experiment?
All the samples should have a comparable total amount of mRNA.
All the samples should have similar distributions.
With these assumptions mean substraction, median subsetraction or quantiles normalization would work. Furthermore, it is quite established that for microarray data quantiles normalization works well. Nevertheless we can explore what would these normalizations do to our data distributions.
sample_mean <- apply(expr_data, 2, mean, na.rm = T)
boxplot(sweep(x = expr_data, MARGIN = 2, STATS = sample_mean, FUN = '-'))
sample_median <- apply(expr_data, 2, median, na.rm = T)
boxplot(sweep(x = expr_data, MARGIN = 2, STATS = sample_median, FUN = '-'))
boxplot(normalize.quantiles(as.matrix(expr_data)))
norm_data <- normalize.quantiles(as.matrix(expr_data))
Batch effects are changes in the biological data that come from external factors. Different personnel performed the experiment, different machines, different days, different reagents, … These are important to correct, since they can have an important effect on the data, and lead us to the wrong conclusions. For this reason, it is important to check if we see any of these effects on the data. But even more important, is to know if we should expect these effects since we have designed and performed the experiment. And even if we have not done the experiment personally, we should know how it has been done and what is its design.
In the first boxplots that we made, we saw that there is some batch effect going on. We also know which are the samples that have been processed in the same batch from the sample_data data.frame.
After we have normalized the data, it seems like the batch effect is gone. But is it really gone? Other approaches to check for batch effects are unsupervised clustering and dimensionality reduction. We will not go into detail about explaining these since they are not the objective of this notebook. They are actually explained in depth in the PCA and Clustering notebook.
Quick explanation about PCA: each gene is a dimension, to plot all the data in one go we would need a ~25000 dimensional plot (which is of course impossible). To solve that, we join all the genes that behave similarly in one dimension. In the plots, each dot is a sample, similar samples will be together in the 2D space; and different samples will be separated. Keep in mind that these 2 dimensions are a subset of all the data, which means that although two samples might seem far away in a dimension they might be close in another dimension. PCAs are also ordered by the amount of variance that they explain, therefore PC1 will show the most variance, then PC2, PC3…
Quick explanation about unsupervised clustering: we will calculate the distance between a sample and all the others samples. Samples are like points in ~25000 dimensional space, therefore we can calculate their distance like this. We can do this for all the samples and we can see which ones are most similars and which are not. Notice that the formula we are using is for Euclidean distance, there are other distances and each one has its purpouses and assumptions.
We can do a quick Principle Components Analysis (PCA) to see how the samples cluster together (similar samples should be together).
What do you expect to see in the PCA?
Samples from the same CMS should cluster together.
Triplicates within the CMS cluster should be closer together.
We will do a PCA and plot the scores of each sample and color them by CMS.
## calculate the PCA components
pca_res <- pca(t(norm_data))
## plot PC1 and PC2 and color by CMS subtype
plot(pca_res@scores[,1], pca_res@scores[,2], col = sample_data$CMS)
And the same plot colored by batch.
## color by batch
plot(pca_res@scores[,1], pca_res@scores[,2], col = sample_data$batch)
What do these PCA plots show?
We can also calculate what is the distance between all the samples to see which ones are most similar and which are most different.
## calculate the distance matrix between samples
distance_matrix <- as.matrix(dist(t(norm_data), method = "euclidean", upper = TRUE, diag = TRUE))
cms_colors <- c('red', 'blue', 'green', 'black')
batch_colors <- c('red', 'blue', 'green')
cms_colors_col <- cms_colors[sample_data$CMS]
batch_colors_col <- cms_colors[sample_data$batch]
heatmap.2(x = distance_matrix,
trace = 'none',
ColSideColors = cms_colors_col)
The more red the color, the more similar two samples are with each other. The colors at the top display the CMS subtype. Do you see any clusters?
heatmap.2(x = distance_matrix,
trace = 'none',
ColSideColors = batch_colors_col)
What do these heatmaps show?
To correct the batch effect several methods have been implemented. For microarray data particularly, the ComBat method has been shown to work particularly well. But for other types of data, single-cell rna-seq data there are other methods.
Notice that batch correction is a supervised approach, meaning that we need to know a priori which are the different batches.
batch_corr_data <- sva::ComBat(norm_data, batch = sample_data$batch)
## Found3batches
## Adjusting for0covariate(s) or covariate level(s)
## Found1471Missing Data Values
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
Now try to explore again the batch corrected data and see what happened.
pca_res <- pca(t(batch_corr_data))
plot(pca_res@scores[,1], pca_res@scores[,2], col = sample_data$CMS)
plot(pca_res@scores[,1], pca_res@scores[,2], col = sample_data$batch)
distance_matrix <- as.matrix(dist(t(batch_corr_data), upper = TRUE, diag = TRUE))
heatmap.2(x = distance_matrix,
trace = 'none',
ColSideColors = cms_colors_col)
heatmap.2(x = distance_matrix,
trace = 'none',
ColSideColors = batch_colors_col)
Scaling is also part of the normalization process. Scaling can be done on the samples and/or on the features.
Scaling of the samples is done to make datasets compatible. If one dataset is in log2 scale, but the other is in log10 scale they are not compatible because the same feature will have significant different ranges depending on the dataset.
Feature importance. It is important that features are all in a similar (if not the same) scale. For example, before we calculated the Euclidean distance, in it we calculated the difference between expression of genes. Let’s consider that a geneA goes from expression of 1 to 1.2; in the same sample geneB goes from 0.1 to 0.2. geneB has quite a change in gene expression because it is doubled between samples; while geneA has a 20% increase. In Euclidean space now the difference is 0.2 and 0.1, making the change in expression of geneA the larger one. Therefore consider, for your particular experiment, which features should be the important ones, if there should be any at all.
In our example we already had the data a scaled a bit since we are working with log transformed values. Therefore making genes with really high expression less valuable compared to lower expressed genes. But we might consider scaling all the genes between 0-1, so that we only compare relative expression between samples.
There are several scaling methods, this article has a nice table with several scaling methods that can be used.
Missing values are quite common and it is important to understand why they exist in our dataset. Again, consider the type of data and the experiment design when thinking about this.
These are the different types:
NA value. It is important to always check our data before and after each processing step, and make sure that everything is as we expect it to be.There are different types of missing values, it is crucial to be able to differentiate between them whenever possible since the methods that are used to deal with each type are different. Sadly, most of the times we have a combination of these different types and it can be difficult to confidently tell which on is which.
In R missing values are usually denoted as NA, but it might be different in other platforms. Consider also if 0 is a missing value or means that it was not truly measured.
Let’s explore how many missing values we have.
sum(is.na(expr_data))
## [1] 1471
And how many non missing values.
sum(!is.na(expr_data))
## [1] 1118597
So there are not so many missing values. Still we want to decide what to do with them. So let’s try to find out which type of missing values are they.
How would you check which kind of missing values are they?
See if there are any samples with a lot/very few missing values.
See if there are any features with a lot/very few missing values.
See in which part of the distribution are the missing values.
Missing values per sample.
apply(expr_data, 2, function(x) sum(is.na(x)))
## X1 X2 X3 X4 X5 X6 X7 X8 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21
## 27 26 41 25 32 30 24 32 36 26 33 23 29 23 30 24 26 27 28 32
## X22 X23 X24 X25 X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38 X39 X40 X41 X42
## 35 17 29 17 33 21 38 25 31 21 32 23 24 19 35 17 39 31 23 34
## X43 X44 X45 X46 X47 X48 X49 X50 X51 X52 X53 X54 X55 X56
## 28 30 26 23 24 27 30 29 27 25 19 18 24 23
hist(apply(expr_data, 2, function(x) sum(is.na(x))))
Seems that most samples have between 30-35 missing values. We actually already took care of two samples before that had a lot of missing values. So it doesn’t look like there is anything too worrying here. If all the missing values where in one sample for example, then we might want to further investigate that.
Missing values per gene.
table(apply(expr_data, 1, function(x) sum(is.na(x))))
##
## 0 1 2 15 18 19 20
## 20211 472 8 1 6 20 24
Most genes have 0 missing values and then there are some have from 1 to 20.
Notice anything strange about these values?
Perhaps by using a heatmap we can spot what is happening.
We can use colors to tell the different subtypes apart: - Orange: CMS1 - Blue: CMS2 - Green: CMS3 - Yellow: CMS4
colors <- c('CMS1' = "#E69F00", 'CMS2' = "#56B4E9",
'CMS3' = "#009E73", 'CMS4' = "#F0E442")
colors_vec <- sapply(sample_data$CMS, function(x) {
return(colors[as.character(x)])
})
## reduce the plot to only genes that at least 1 missing value
genes_with_nas <- which(apply(expr_data, 1, function(x) sum(is.na(x))) > 0)
reduced_mat <- as.matrix(expr_data[genes_with_nas,])
heatmap.2(reduced_mat, Rowv = NULL, Colv = NULL,
dendrogram = 'none', scale = 'none', trace = 'none',
ColSideColors = colors_vec, na.color = 'grey30')
What are your first impressions?
We can focus only on the genes that have between 1 and 5 missing values.
genes_with_nas <- which(apply(expr_data, 1, function(x) sum(is.na(x))) > 0 &
apply(expr_data, 1, function(x) sum(is.na(x))) < 6)
reduced_mat <- as.matrix(expr_data[genes_with_nas,])
heatmap.2(reduced_mat, Rowv = NULL, Colv = NULL,
dendrogram = 'none', scale = 'none', trace = 'none',
ColSideColors = colors_vec, na.color = 'grey30')
What are your first impressions?
We can check the genes with more than 5 missing values
genes_with_nas <- which(apply(expr_data, 1, function(x) sum(is.na(x))) > 5)
reduced_mat <- as.matrix(expr_data[genes_with_nas,])
heatmap.2(reduced_mat, Rowv = FALSE, Colv = FALSE,
dendrogram = 'none', scale = 'none', trace = 'none',
ColSideColors = colors_vec, na.color = 'grey30')
What are your first impressions?
So far we have explored our missing values. It looks like the majority of them are random, with emphasis at missing at random but with a detection limit component. Finally there is one gene that has not at random missing values.
Now we have to decide what do we do with these missing values. Again, depending on the type of data and experiment we might want to keep the missing values, remove them or impute them. It is important to decide whether the missingness of a value has information or not.
Some options of what we can do:
What would you do?
Dealing with missing values is quite complex and is extremely dependent on the experiment. In this particular case our final objective is to build a predictor, a predictor needs to be robust, therefore using genes that are sometimes detected and sometimes not is not a good idea. Since most of the values seem random imputation based on the most similar sample would probably work fine. On the other hand, since there are not so many genes that actually have missing values we could also not use them.
We can also try different methods, and then see how the data looks like; specially in this case since we are adding data that will be as important as the real data we want to check that we are not creating any aberrations.
Removing missing genes.
genes_to_remove <- which(apply(expr_data, 1, function(x) sum(is.na(x))) > 0)
reduced_data <- expr_data[-genes_to_remove,]
Setting missing values as the sample mean.
mean_col <- apply(expr_data, 1, mean, rm.na = TRUE)
imputed_data <- expr_data
for (i in seq_len(ncol(imputed_data))) {
col_data <- imputed_data[,i]
col_data[is.na(col_data)] <- mean_col[i]
imputed_data[,i] <- col_data
}
Setting missing values as a lower quantile value.
quantile(expr_data, na.rm = T)
## 0% 25% 50% 75% 100%
## 2.053516 3.555102 4.709975 7.330005 17.865068
imputed_data <- expr_data
imputed_data[is.na(imputed_data)] <- quantile(expr_data, na.rm = T)[1]
norm_imputed <- normalize.quantiles(as.matrix(imputed_data))
norm_imputed_batchcorr <- sva::ComBat(norm_imputed, batch = sample_data$batch)
## Found3batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
This part will very briefly touch on aleviating the amount of data given to to train/predict. We have measured ~25000 genes, but to predict the different subtypes just a few of them are necessary. In this follow-up paper they built a classifier with 113 probes. Therefore it would reduce the amount of computation time and difficulty for the algorithms if we could a priori reduce the amount of features. There is also the consideration of robustness, reducing the amount of required features is nice, but reducing it too much might make our method too unstable, having some redundancy is also nice.
This part is highly dependent on the machine learning method used, some methods have feature selection already incorporated in the algorithm itself, while some others do not. You can read about it here.
Nevertheless, let’s say that our algorithm does not have feature selection.
IMPORTANT: feature selection is part of ML, therefore we must apply cross-validation methods to ensure that we are not overestimating the performance of our model.
What kind of metrics could we used to reduce the amount of features?
Features that have near zero variance: if the expression of a gene does not change across samples it will not be informative.
Correlated features: if two or more genes have a highly correlated expression, they contain the same information.
Linear dependencies: if the expression of a gene can be calculated by the mathematical combination of two or more other genes.
We can check how many features have low variance.
hist(apply(batch_corr_data, 1, var, na.rm = T), xlim = c(0, 5), breaks = 100, main = "Variance")
As we can see, many of them have very low variance, these are not very informative.
Does higher variance correlate with better class separation?
In this case we could remove genes that have less than a certain variance threshold.
relevant_genes <- which(apply(batch_corr_data, 1, var, na.rm = T) > 0.5)
batch_corr_data <- batch_corr_data[relevant_genes,]
length(relevant_genes)
## [1] 4528
We can try different combinations of the above described methods (normalization, scaling, dealing with missing values) and see what is their effect on training a model.
To get some reliable values we will do a simple quick cross-validation by setting up a subset of the samples separated. Let’s keep for example the 3rd batch out.
x_train <- norm_imputed_batchcorr[, sample_data$batch %in% c(1, 2)]
x_validation <- norm_imputed_batchcorr[, sample_data$batch %in% c(3)]
y_train <- sample_data$CMS[sample_data$batch %in% c(1, 2)]
y_validation <- sample_data$CMS[sample_data$batch %in% c(3)]
model <- build_model(x_train, y_train)
predictions <- predict_new_data(x_validation, model)
performance(y_validation, predictions)
## labels
## predictions CMS1 CMS2 CMS3 CMS4
## CMS1 4 0 0 0
## CMS2 0 5 0 0
## CMS3 0 0 3 0
## CMS4 0 0 0 6
Let’s try with other datasets, for example if we use norm_imputed that has not been batch corrected, will it work? You can try other states of the dataset, unfortunately this machine learning method does not deal automatically with missing values, so you will have to take care of them first.
x_train <- norm_imputed[, sample_data$batch %in% c(1, 2)]
x_validation <- norm_imputed[, sample_data$batch %in% c(3)]
y_train <- sample_data$CMS[sample_data$batch %in% c(1, 2)]
y_validation <- sample_data$CMS[sample_data$batch %in% c(3)]
model <- build_model(x_train, y_train)
predictions <- predict_new_data(x_validation, model)
performance(y_validation, predictions)
## labels
## predictions CMS1 CMS2 CMS3 CMS4
## CMS1 0 0 0 0
## CMS2 0 0 0 0
## CMS3 0 0 0 0
## CMS4 4 5 3 6
x_train <- norm_imputed[, sample_data$batch %in% c(1, 2)]
x_validation <- norm_imputed[, sample_data$batch %in% c(3)]
y_train <- sample_data$CMS[sample_data$batch %in% c(1, 2)]
y_validation <- sample_data$CMS[sample_data$batch %in% c(3)]
y_train <- sample(y_train, length(y_train))
model <- build_model(x_train, y_train)
predictions <- predict_new_data(x_validation, model)
performance(y_validation, predictions)
## labels
## predictions CMS1 CMS2 CMS3 CMS4
## CMS1 0 0 0 0
## CMS2 0 1 0 0
## CMS3 0 0 0 0
## CMS4 4 4 3 6
In this notebook we have learned about some of the key aspects of data pre-processing, exploration and cleaning. Keep in mind, that the methods that we have applied here are particularly suited for this kind of data and experiment. It is of extreme importance to research which are the adequate methods for each type of dataset and type of experiment.